NAR Genomics and Bioinformatics — Latest Matching Preprints

1

CuGen: A GPU-accelerated framework for large-scale genomics

Kiiskinen, T.; Richland, J.; Wang, W.; Lu, W. S.; Balasubramanian, N.; Hastie, T.; Tibshirani, R.; Rivas, M. A.

2026-07-17 genetic and genomic medicine 10.64898/2026.07.15.26358178 medRxiv

Top 2%

2.1%

Show abstract

Biobank-scale genomic analyses remain computationally expensive, CPU-bound workflows, particularly when adjusting for confounding. Here, we present CuGen, a GPU-accelerated framework for large-scale genomics. CuGen uses UltraLasso, a novel hierarchical application of univariate-guided sparse regression (uniLasso), to select a compact, phenotype-informed active set of fewer than 30,000 variants. This achieves robust leave-one-chromosome-out (LOCO) confounding control, enabling both downstream GWAS and in-sample fine-mapping. Additionally, we introduce the .cugen file format, a genotype representation designed for memory-optimized, high-throughput streaming and random access on GPU hardware. Building on this substrate, we provide a general GPU-accelerated genomics toolkit handling polygenic prediction, data manipulation, quality control, analysis, and visualization. We demonstrate CuGen's efficacy in the UK Biobank with up to 408,624 individuals, where the full GWAS pipeline and fine-mapping against 6.8 million imputed variants completes in approximately 10 minutes on a single high-throughput GPU with 80 GB of memory. The pipeline scales efficiently to massive phenome-wide analyses with sublinear resource consumption.

2

Reproducible-by-design: Romics Processor, a FAIR ecosystem for multi-omics and spatial-omics analysis

Gorman, B. L.; Bhotika, H.; Jehrio, M.; Purkerson, J. M.; Carlin, F.; Nakayasu, E. S.; Misra, R. S.; Adkins, J. N.; Anderton, C. R.; Pryhuber, G.; Clair, G. C.

2026-07-15 bioinformatics 10.64898/2026.07.09.737600 medRxiv

Top 2%

1.7%

Show abstract

Multi-omics and spatial-omics technologies are exploding in use, producing increasingly complex datasets. Existing bioinformatics tools are developing rapidly but fail to fully enforce the FAIR principles, leaving the field vulnerable to escalating issues in computational reproducibility. Here, we introduce a reproducible-by-design paradigm represented in an omics data processing package, RomicsProcessor. At its core, the "Romics_object", which is a self-contained digital artifact that encapsulates the full history of the data from the original data to the fully processed state, capturing the details of the transformative steps and the required dependencies. This architecture ensures that computational workflows are fully portable and reproducible. In this manuscript, we demonstrate RomicProcessors computational capabilities and scalability on diverse datasets, including bulk proteomics, large-scale multiplexed immunofluorescence, and multi-batch mass spectrometry imaging. Providing a robust framework for truly FAIR Data Principles-based analysis, RomicsProcessor is a blueprint for the next generation of reproducible bioinformatics tools that can dramatically accelerate discovery in multi-omics biology in the era of artificial intelligence.

3

Mapping Topic Change in Influential Hepatocellular Carcinoma Research: A Two-Cohort Bibliometric Analysis

Su, Z.; Li, T.

2026-07-16 oncology 10.64898/2026.07.07.26357427 medRxiv

Top 3%

1.5%

Show abstract

The therapeutic landscape for hepatocellular carcinoma (HCC) is evolving rapidly, necessitating scalable approaches to synthesize the expanding scientific literature. We characterized thematic shifts in HCC treatment and prognosis research by conducting a retrospective bibliometric analysis of influential publications from 2023 and 2024. Using the OpenAlex database, we identified the 50 most highly cited papers from each year based on eighteen-month post-publication citation counts. Large language models were deployed to extract, normalize, and classify concepts from unstructured text into canonical topics and parent themes, enabling quantitative year-over-year frequency comparisons. Analysis of these 100 papers revealed a distinct maturation in research focus. Although broad categories like general immunotherapy remained prevalent, their relative frequency declined in favor of specific dual immune checkpoint regimens, notably CTLA-4 inhibition and the durvalumab plus tremelimumab combination. Concurrently, parent themes related to radiomics, imaging, and health systems exhibited significant growth in the 2024 cohort. These findings demonstrate a thematic transition in high-impact HCC research from foundational immuno-oncology toward optimized combination therapies and precision diagnostics. Furthermore, this study highlights the utility of artificial intelligence-driven bibliometrics for objectively tracking dynamic conceptual shifts in oncology. A web interface for exploring the data is available at https://pri.pepkio.com/.

4

Hypertension Phenotypes in a National Database: A Three-Axis State Model Integrating Diagnosis, Treatment Intensity, and Blood Pressure Control (The NDB-K7Ps-Study-8)

nakajima, K.; Sekine, A.

2026-07-19 cardiovascular medicine 10.64898/2026.07.16.26358276 medRxiv

Top 4%

0.9%

Show abstract

Hypertension is commonly defined as a binary condition despite substantial heterogeneity in diagnosis, treatment, and blood pressure (BP) control. We propose a three-axis state model integrating diagnosis status, treatment intensity, and BP control to better characterize hypertension phenotypes. The framework generates 27 possible states that can be condensed into seven clinically meaningful groups. We applied the model to 5,129,584 Japanese adults using the National Database of Health Insurance Claims and Specific Health Checkups. Hierarchical cluster analysis, sensitivity analysis excluding patients with cardiovascular diseases other than hypertension, and validation against antihypertensive medication use were performed. Overall, 64% of participants were classified as normotensive, whereas 36% belonged to hypertension-related groups, including 11% with unrecognized hypertension and 7% with diagnosed but untreated hypertension. Agreement with data-driven hierarchical cluster analysis was substantial (weighted {kappa}=0.87). The group distribution remained largely unchanged in the sensitivity analysis, supporting the robustness of the proposed classification. Hypertension diagnosis also showed high validity, with a sensitivity of 96.5%, specificity of 91.8%, and substantial agreement with antihypertensive medication use ({kappa}=0.78). This three-axis framework provides a robust and clinically interpretable approach for characterizing hypertension phenotypes, enabling systematic identification of care gaps and supporting research, clinical decision-making, and population health management.

5

Privacy-Preserving Matching for Federated Causal Inference in Multicentre Patient Cohorts

Gusinow, R.; Morgan, A. S.; Canziani, L. M.; Zeitlin, J.; Kim, M.; Gentilotti, E.; Ghosn, J.; Florence, A.-M.; Tami, A.; Toschi, A.; Palacios-Baena, Z. R.; Tacconelli, E.; Hasenauer, J.

2026-07-19 epidemiology 10.64898/2026.07.16.26358171 medRxiv

Top 5%

0.6%

Show abstract

Causal effect estimates can often be biased in clinical and epidemiological studies as patient cohorts frequently exhibit substantial covariate imbalances between treated and control groups, often amplified in multicentre studies due to heterogeneous recruitment, clinical practice, and case mix. Covariate balancing methods are therefore essential for valid causal inference. However, their application becomes challenging when data are distributed across cohorts and cannot be pooled because of privacy, legal, or institutional constraints, leaving a gap in practical methods for causal effect estimation in federated and imbalanced clinical data settings. We develop a privacy-preserving framework for covariate balancing and causal effect estimation across distributed data providers, combining federated aggregation with differential privacy to enable propensity score subclassification and matching without sharing individual-level records. Matching relies on non-disclosive quantities and differentially private distance evaluation, and the resulting matched subsets remain local to each server. Balance can be assessed through federated diagnostics and privacy-preserving visualisations, and we provide secure estimators for average treatment effects with associated uncertainty quantification. We implement this framework in the DataSHIELD federated analysis platform via 2 R packages. In simulations, we demonstrate agreement between federated and centralised analyses in the absence of privacy noise and quantify the bias--variance trade-offs induced by differential privacy. We illustrate applicability in two multinational settings-a Long COVID cohort and very preterm birth cohorts-showing that the approach enables practical causal analyses under real-world data protection constraints. The DataSHIELD packages are available on Github. Additional methodological details are provided in the Supplementary Material.

6

Aggregating data to accelerate personalized therapy in heart failure (ADAPT-HF)

Roeder, C.; Goerg, C.; Talebi, A.; Stevens, L. M.; Scholtens, D. M.; Rasmussen-Torvik, L. P.; Alagna, L. M.; Shah, S. J.; Hall, J. L.; Das, A. K.; Jhund, P. S.; Kao, D. P.

2026-07-16 health informatics 10.64898/2026.07.13.26357501 medRxiv

Top 5%

0.6%

Show abstract

Background: Increased public access to data from disparate sources provides opportunities to study and validate predictive and subphenotype models in heterogeneous disease conditions using aggregated individual patient data. Robust, explicit, and transparent harmonization of data elements is critical to ensure interpretability, reproducibility, and generalizability of secondary and retrospective analyses. Methods & Results: We designed and implemented ADAPT (Aggregating Data to Accelerate Personalized Therapy), a scalable framework using multiple software packages (R, SQL, BigQuery) that enables rapid, explicit harmonization of structured data elements from randomized trials and observational studies using a standard spreadsheet interface. User-specified criteria are applied to primary study data to produce harmonized longitudinal datasets comprised of demographics, medical history, quantitative observations, repeated measures, and clinical outcomes. We demonstrate this functionality using 26 clinical studies found in the National Heart, Lung, and Blood Institute BioLINCC resource. We illustrate the scalability of ADAPT to the order of billions of datapoints using administrative clinical data in a cloud-computing platform. We also present examples of collaborators using ADAPT for independent harmonization tasks for secondary analyses and democratization of publicly available data. Conclusion: ADAPT is a disease-agnostic, extensible, and scalable platform to support robust, transparent harmonization of structured research data using interfaces accessible to a variety of researchers regardless of programming ability. It extends FAIR principles beyond research data to also represent harmonization analyses by improving Findability of harmonization decisions, Accessibility of methods to other stakeholders, Interoperability with independent analyses and datasets, and Reusability through efficient implementation in a variety of analysis environments.

7

Reliability-weighted target prioritization in CD4+ T-cell Perturb-seq: a generalizability-theory decomposition

Cheng, C.

2026-07-15 bioinformatics 10.64898/2026.07.13.738312 medRxiv

Top 6%

0.5%

Show abstract

Genome-scale Perturb-seq screens prioritize candidate targets by the strength of a perturbations transcriptional effect. Effect strength does not answer a prior measurement question: is the readout dependable? A large effect estimated from a single guide, a single donor, or a pseudobulk of few cells need not survive replication, and for target prioritization each false lead costs a validation experiment. We treat each perturbation effect as a measurement in a crossed Target x Guide x Donor x Condition design and apply generalizability theory (Brennan, 2001; Cronbach et al., 1972) to separate the dependable part of an effect from facet-specific idiosyncrasy. Guides and donors enter as random facets; condition enters as a fixed facet and is analyzed within its levels. For each target we report a dependability profile over the facets and a joint generalizability coefficient over the two random facets, and we re-rank targets by effect magnitude weighted by that coefficient. On the released screen (Zhu et al., 2025), removing the measurement-error floor estimated from the non-targeting controls raises the number of genes with a dependable target-signal share above .10 from 40 to 7,674. Analyzed within activation states, dependability recovers the T-cell-receptor signaling module as reliably measurable only in activated cells, without recourse to gene annotation. A design study indicates that reliability is limited by the number of guides rather than the number of donors, so a future screen should add guides. Every methodological decision was recorded and adversarially reviewed, and all results regenerate from the released summary statistics.

8

LocusBlend: Flexible multi-index regional visualization of genomic association signals

yang, c.; Cook, N.; Zeng, Y.; Fu, T.; budde, J.; Cruchaga, C.; Belloy, M. E.

2026-07-21 genetic and genomic medicine 10.64898/2026.07.15.26358129 medRxiv

Top 7%

0.3%

Show abstract

Summary It has become standard practice to visualize regional signals from genomewide association studies GWAS using LocusZoom plots Similarly GWAS signals are compared to regionally matched quantitative trait loci QTLs ie varianttogene regulation data using LocusCompare plots to aid assessment of candidate traitrelated genes Despite broad usage these tools annotate variants by linkage disequilibrium LD to a single lead or index variant This singleindex representation has limitations for visualizing complex loci that contain multiple independent signals We present LocusBlend an interactive web application for multiindex LDblended visualization of genomic loci LocusBlend supports one or two genomic association summarystatistic datasets and one to three index variants multiindex LocusZoom colorblended plots and matching LocusCompare visualizations Applications to Alzheimers disease GWAS and QTL signals illustrate LocusBlend enables visualization and separation of independent signals despite shared LD and high genomic complexity Overall LocusBlend is aimed at supporting researchers handle the continuously expanding complexity of human genomics findings Availability and Implementation LocusBlend is freely available at httpslocusblendwustledu Publication ready plots are generated in 1min Source code documentation example datasets input templates and reproducibility instructions are available at httpsgithubcomBelloyLabLocusBlend LocusBlend is implemented in Python using Streamlit Plotly and PLINK Supplementary Information Supplementary data are available online

9

OTTR-CLASH: improved biochemical and bioinformatic identification of Argonaute 2-mediated microRNA-target RNA interactions

Kaufman, P. D.; Liu, H.; Hu, K.; Ferguson, L.; Collins, K.; Zhu, L. J.; Pederson, T.

2026-07-15 molecular biology 10.64898/2026.07.14.738487 medRxiv

Top 7%

0.3%

Show abstract

Various methods have detected miRNA-target interactions via immunoprecipitation of UV-crosslinked Argonaute ribonucleoprotein complexes, followed by intermolecular ligation of bound miRNAs to target strands, forming chimeric RNAs. To date, these methods have relied on conventional viral reverse transcriptases (RTs) to generate cDNAs for sequencing. However, crosslinked RNAs often retain adducts after purification, which can make them poor templates for viral RTs. Here, we adapted OTTR (Ordered Two-Template Relay) techniques to generate cDNAs from Ago2-bound RNAs. OTTR makes use of a modified retroelement-encoded RT, which is strongly processive even on templates with modifications or adducts. We show that this "OTTR-CLASH" method increases the frequency of generating chimeric RNAs compared to previous methods. We also developed an improved bioinformatic pipeline for analysis of these data, and we use this to catalog miRNA-target interactions not previously described in the literature. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=147 HEIGHT=200 SRC="FIGDIR/small/738487v1_ufig1.gif" ALT="Figure 1"> View larger version (24K): org.highwire.dtl.DTLVardef@13bc276org.highwire.dtl.DTLVardef@5beb41org.highwire.dtl.DTLVardef@b204e5org.highwire.dtl.DTLVardef@15f747d_HPS_FORMAT_FIGEXP M_FIG C_FIG

10

PRANA: A Deep Learning Method for Adapting Polygenic Risk Scores to Diverse Ethnic Groups

Levi, H.; The Breast Cancer Association Consortium, ; Michailidou, K.; Elkon, R.; Shamir, R.

2026-07-15 genetic and genomic medicine 10.64898/2026.07.12.26357860 medRxiv

Top 8%

0.3%

Show abstract

Polygenic risk scores (PRSs), which quantify inherited susceptibility to complex traits and diseases, have emerged as valuable tools for risk stratification and precision medicine. Despite their promise, PRS developed on European cohorts often demonstrate substantially reduced predictive accuracy in non-European populations, due to differences in genetic architecture. The disproportionate representation of European ancestry cohorts in genome-wide association studies (GWAS) leads to inequitable deployment of PRS technologies across diverse populations. Here, we introduce PRANA (Polygenic Risk Adaptation via Neural-network Architecture), a deep learning framework that adapts an existing PRS developed on one population to other ancestries. Unlike methods that require large-scale GWAS in the target population, PRANA leverages pre-trained PRS models derived from European cohorts and adapts them using modestly sized cohorts from the target population. We evaluated PRANA on seven complex traits in South Asian, East Asian and Ashkenazi Jewish populations, as well as in selected smaller East Asian subpopulations where the scarcity of training data poses a particular challenge. PRANA mostly improved predictive performance of the baseline PRS models by 5%-20% in terms of effect size and Nagelkerke's R^2, and, in most cases, outperformed existing cross-ancestry multi-PRS approaches. These results highlight PRANA as a scalable and practical strategy to reduce disparities in genomic risk prediction and advance the equitable application of PRS in diverse populations.

11

Multi-tissue analyses of allele-specific chromatin accessibility nominate likely functional variants for type 2 diabetes

Narisu, N.; Li, H. X.; Rathbun, C. J. M.; Varshney, A.; Swift, A. J.; Yan, T.; Sinha, N.; Currin, K. W.; Xue, D.; Robertson, C. C.; Taylor, D. L.; Taylor, H. J.; Beck, A.; Lee, B. N.; Wang, L.; Broadaway, K. A.; Wilson, E. P.; Stringham, H.; Saramies, J.; Lakka, T. A.; Spracklen, C. N.; Scott, L. J.; Stitzel, M. L.; Tuomilehto, J.; Laakso, M.; Koistinen, H. A.; Boehnke, M.; Arda, H. E.; Chen, S.; Biesecker, L. G.; Bonnycastle, L. L.; Erdos, M. R.; Mohlke, K. L.; Parker, S. C. J.; Collins, F. S.

2026-07-15 health informatics 10.64898/2026.07.14.26358094 medRxiv

Top 8%

0.2%

Show abstract

Genome-wide association studies (GWAS) have identified >1,200 signals associated with type 2 diabetes (T2D), yet identifying functional variants remains challenging because the majority of them lie in noncoding regions of the genome and are in areas of high linkage disequilibrium (LD). While chromatin accessibility QTL (caQTL) and expression QTL (eQTL) analyses are useful for nominating regulatory mechanisms underlying GWAS signals, limitations still exist in pinpointing functional variants within regions of high LD. A complementary approach that has been less frequently applied is to focus on the allele-specific effect on chromatin accessibility at heterozygous single-nucleotide polymorphisms (SNPs), hereafter referred to as allelic imbalance. We analyzed the allelic imbalance of reads generated from an assay for transposase-accessible chromatin with sequencing (ATAC-seq) across genotyped samples from 490 donors in T2D-relevant tissues: skeletal muscle, liver, pancreatic islets, adipose tissue, and relevant cell types. We identified 119,949 allelically imbalanced SNPs (FDR<0.05) across the genome. The allelic imbalance was often most prominent in one tissue and showed an enrichment overlapping with tissue-specific transcription factor (TF) binding footprints. Focusing on the 8,581 SNPs in previously published 99% credible sets from 338 T2D GWAS signals, we identified 256 imbalanced SNPs across 123 (36.4% of) signals, each showing allelic imbalance in at least one tissue or cell type. Of these, 71 signals contained only a single imbalanced SNP, representing excellent candidate causative variants. As a proof-of-concept, we showed that 23 of the 256 imbalanced SNPs were supported by allelic assays from previous studies. Further, we experimentally validated two imbalanced SNPs as likely functional variants: rs34584161 among a seven-SNP T2D credible set at the RNF6 signal in islets and rs849134 among a 13-SNP credible set at the JAZF1 signal in liver. This study demonstrates the power of integrating ATAC-seq allelic imbalance (ASAI) with GWAS statistical fine-mapping to identify candidate functional regulatory variants from among tightly linked GWAS variants in disease-relevant tissues. While applied here in T2D, this approach represents a widely applicable high-throughput framework for refining the genetic architecture of complex traits.

12

The Variance-Stabilizing Transformation for the Poisson Rate Ratio: Closed-Form Confidence Intervals

Ng, S.-P.

2026-07-18 epidemiology 10.64898/2026.07.16.26358255 medRxiv

Top 9%

0.2%

Show abstract

The incidence rate ratio R is the standard measure for comparing event rates in clinical trials and epidemiology. In vaccine trials, the vaccine efficacy is VE = 1 - R. When events are rare, the two arm counts are Poisson. The estimator of R is heteroskedastic: its sampling variance changes with the data. So no fixed-width interval covers correctly everywhere. The usual log-Wald interval is undefined at zero events and covers poorly at small counts. Early vaccine and drug-safety readouts fall in exactly this regime. We show that a single reparameterization collapses this bivariate problem to an effective one-parameter family with a quadratic variance function, whose variance-stabilizing transformation is 2 arcsinh(sqrt(R)). The reduction yields a closed-form confidence interval for R. Its two leading errors, a curvature bias and the variability of the estimated scale, each admit a closed-form correction with no tuning constants. In a Monte Carlo study of our seven arcsinh variants and five competitors, the +Curve+Stu variant covers within 0.002 of the nominal 0.95 for about 50 control and 5 treatment events. Its width is on par with the best competitor. It avoids the conservatism and zero-count breakdown of log-Wald and MOVER. For moderate counts, we recommend this interval; for sparser data, our Bar-Lev and Enis count-shift variant is more robust. The result is a ready-to-use, closed-form interval for the low-count regime. We illustrate it on early Covid-19 vaccine-efficacy readouts and provide reference implementations in R and Python.

13

Storing >1 byte of information in 16S ribosomal RNA using orthogonal trans-splicing ribozymes

Dysart, M. J.; Fang, L.; Karinje, L. K.; Chappell, J.; Stadler, L. B.; Silberg, J. J.

2026-07-15 synthetic biology 10.64898/2026.07.14.738544 medRxiv

Top 10%

0.1%

Show abstract

TEXT ABSTRACTCatalytic-RNA (cat-RNA) expressed from mobile DNA can record cellular events, such as the uptake of plasmids via horizontal gene transfer, by splicing a barcode onto 16S ribosomal RNA (rRNA) - a system termed RNA addressable modification (RAM). However, scaling RAM to record multiple simultaneous biological events requires large numbers of orthogonal cat-RNA whose signals reflect the biological features under investigation rather than variability arising from the barcode sequence. Here, we explore how to design orthogonal cat-RNA to record information about multiple plasmid-encoded traits in parallel. We show that cat-RNA having tRNA-derived barcodes with sequence variation in the anticodon stem-loop present greater signal consistency within Escherichia coli than mRNA-derived barcodes. When orthogonal cat-RNA designs harboring tRNA-derived barcodes were evaluated in Vibrio natriegens and Pseudomonas putida, increased variance was observed compared with Escherichia coli. Nevertheless, the signal consistency was sufficient to use these orthogonal cat-RNAs to report on the relative activities of four promoters and two origins of replication by sequencing barcoded-rRNA derived from the three organisms. These results show how RAM can be multiplexed to report on mobile DNA features in microbial communities and illustrate the importance of accounting for variability in RNA outputs when designing and interpreting multiplexed RNA barcoding data. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=88 SRC="FIGDIR/small/738544v1_ufig1.gif" ALT="Figure 1"> View larger version (29K): org.highwire.dtl.DTLVardef@406ebaorg.highwire.dtl.DTLVardef@259751org.highwire.dtl.DTLVardef@1f1512corg.highwire.dtl.DTLVardef@8384b_HPS_FORMAT_FIGEXP M_FIG C_FIG

14

Early Feasibility of NIVA Score Decongestion Responsiveness: A Pilot Clinical and Preclinical Study

Alvis, B. D.; Schmeckpeper, J.; Rali, A. S.; Huston, J.; Tsai, S.; Amancherla, K.; Armstrong, D.; Gupta, R.; Whitfield, J. S.; Harder, R.; Miller, K.; Horne, M.; Wervey, D.; Pein, R.; Isanaka, T.; Case, M.; Wise, E.; Perrien, B.; Brophy, C.; Lindenfeld, J.; Hocking, K.

2026-07-15 cardiovascular medicine 10.64898/2026.07.13.26357910 medRxiv

Top 10%

0.1%

Show abstract

Residual congestion is the principal driver of heart failure readmission, and reliable serial assessment of volume status remains an unmet clinical need. This study asked whether a wrist-worn, machine-learning-based device for non-invasive venous waveform analysis in heart failure (the NIVAHF device), which produces an integer-scaled estimate of pulmonary capillary wedge pressure termed the NIVA Score, responds to acute changes in volume status. Agreement between the NIVA Score and invasively measured pulmonary capillary wedge pressure at single time points has been established in a separate prospective, multi-site study; however, such static agreement does not establish whether the measure tracks dynamic decongestion. We therefore evaluated the directional responsiveness of the locked NIVA Score in two prespecified cohorts: hospitalized adults with acute decompensated heart failure undergoing routine intravenous diuresis, and a controlled porcine model of volume overload followed by diuresis. In eleven patients contributing thirteen paired measurements (mean net fluid balance -2.1 {+/-} 1.0 L), NIVA Scores decreased significantly after diuresis (paired t-test, P = 0.04). In five pigs contributing twenty-four paired measurements, NIVA Scores decreased significantly after intravenous furosemide following crystalloid loading (P < 0.01), and the direction of change was concordant with measured urine output in every animal. Statistical significance was reached in both cohorts despite modest sample sizes, indicating a measurable NIVA Score reduction with volume removal. In an exploratory analysis, the discharge NIVA Score yielded an area under the receiver-operating-characteristic curve of 0.85 (95% confidence interval 0.575-1.00; P = 0.04) for thirty-day readmission. Together, the significant, directionally concordant NIVA Score reductions across independent clinical and preclinical cohorts demonstrate that the device tracks acute decongestion and support its use for serial, non-invasive congestion monitoring; an adequately powered prospective study is the planned next step.

15

Sex-different phenotypic correlations: Due to genes or environment?

Fritz, A.; Darrous, L.; Bonnelykke, K.; Pedersen, A. G.; Kutalik, Z.

2026-07-15 genetic and genomic medicine 10.64898/2026.07.13.26357694 medRxiv

Top 10%

0.1%

Show abstract

Differences in physical features and disease prevalence between men and women are examples of sexual dimorphisms. However, sex differences can manifest not only in trait means but also in how strongly risk factors are linked to diseases (e. g. BMI to cardiovascular disease), a question heavily under-researched. To fill this gap, we set out to identify sex differences in phenotype correlations (rP) and decompose them into genetic (rG) and environmental (rE) contributions. Our analysis revealed 250 trait pairs with significant sex-different phenotypic correlations in the UK Biobank. Overall, we observed a predominance of environmental contributions to sex-different effects: 182 trait pairs (73%) exhibited exclusively sex-different rE, while 68 (27%) showed sex differences in both rE and rG, and no trait pair was affected solely by sex-specific rG. For example, we detected sex-different environmental correlation between C-reactive protein and BMI (rE(men) = 0.07 vs rE(women) = 0.25), but no sex-difference in genetic correlation. On the contrary, glycated haemoglobin and LDL cholesterol showed genetic correlation only in women (rG(women) = 0.17; 95% CI = [0.1, 0.23]), but environmental correlation only in men (rE(men) = -0.18; 95% CI = [-0.19, -0.16]). Some of the observed sex differences - including those involving testosterone, SHBG, urate, waist-hip ratio, and triglycerides - may reflect underlying sex-specific genetic architectures, as evidenced by low between-sex genetic correlations. In conclusion, environmental factors are the predominant contributors to sex differences in phenotypic correlations between complex traits, with modest detectable contributions from sex-specific genetic architectures. Recognising these patterns can inform the development of more effective, sex-informed interventions.

16

From amplicon to antigen: a quantified transmission map that nominates multi-antigen antibody-drug-conjugate co-target sets across cancer types

Lam, J. M.; Walker-Samuel, S.; Pennycuick, A.

2026-07-16 oncology 10.64898/2026.07.13.26357987 medRxiv

Top 10%

0.1%

Show abstract

Somatic copy-number amplification is pervasive in cancer, and the genes it carries are candidate drug targets - but only those whose amplification is transmitted to accessible surface protein can be reached by an antibody-drug conjugate (ADC). We build an integrated map of copy-number-to-protein transmission across six tumour types and ask, for every amplified gene, whether its dosage reaches the surface. Copy number transmits to mRNA (median per-gene r = 0.21) but is attenuated at the protein level in 85% of genes, and the mRNA ranking is largely preserved to protein (rho = 0.70); the ranking is set principally at the chromatin/transcription step - among directly measured regulatory inputs, promoter DNA methylation and tumour chromatin accessibility each explain about an order of magnitude more of the transmission variance than gene structure, and do so complementarily. Critically, transmissibility is a stable, gene-intrinsic property: it is predictable from gene properties alone, with no proteomic input, at a leave-gene-out rank correlation of 0.52 (R2 = 0.29); it is not positional (holding out whole chromosome arms changes accuracy by 0.001); and it transfers across lineages (Kendall W = 0.97 across leave-one-lineage-out refits). This licenses a predictor that nominates surface targets in cancer types that lack a tissue-referenced proteome, combining direct protein measurement where it is available with prediction where it is not. Requiring co-elevation on a recurrent amplicon with measured transmissibility and an accessible extracellular ectodomain nominates 22 surface antigens on 18 distinct recurrent amplicons across four cancer types (renal, endometrial and both lung subtypes) - for example ITGB8+TSPAN13+TTYH3 on lung 7p, NCSTN+HSD17B7+MPZL1 on 1q (recurrent in several types), the transferrin receptor TFRC on squamous 3q, and FZD1 on clear-cell renal 7q; 21 of the 22 are non-driver passengers and 10 are confirmed on the experimental Cell Surface Protein Atlas. In single malignant cells, against a null that controls for per-cell sequencing depth, the co-detected constructs sit at a modest 1.05-1.45x above independence (p < 0.001, donor-block bootstrap intervals clear of 1.0), and at binding-relevant thresholds the normal-tissue co-expression collapses - so an avidity AND-gate that binds stably only where the antigens co-occur would spare normal cells that carry only one. Observed transmissibility itself transfers strongly between the two lung subtypes ({rho} = 0.88) and remains positive across distant lineages, consistent with the shared cell-of-origin regulation the map implies. Single-cell co-detection is demonstrated wherever a malignant single-cell atlas exists (both lung subtypes and glioblastoma - the latter entirely from prediction, using no GBM surface-abundance measurement); the remaining cohorts are nominated on the same genetic and topological evidence. The result is a pan-cancer, confidence-tiered catalogue of multi-antigen ADC co-target sets with a concrete plan to test them.

17

Scaling ECG Foundation Models and Identifying a Threshold for Effective Representation Learning

Sriram, R.; Nenadic, I.; Shahrabani, E.; Goonewardena, S.; Yao, S.; Farrell, B.; Loring, Z.; Murthy, V. L.

2026-07-17 cardiovascular medicine 10.64898/2026.07.15.26358182 medRxiv

Top 10%

0.1%

Show abstract

We conducted a scaling evaluation of unlabeled pretraining for electrocardiogram foundation model performance. One-dimensional vision transformer masked autoencoders were pretrained across increasing ECG volumes and fine-tuned for rhythm, morphology, diagnostic, and structural heart disease tasks. Models pretrained below 400,000 ECGs failed to consistently exceed controls without self-supervised pre-training, whereas 600,000 to 800,000 ECGs improved AUROC across tasks, suggesting a minimum threshold for effective ECG representation learning.

18

European-derived coronary artery disease polygenic scores over-flag genetic risk in Vietnamese and Southeast Asian populations: a multi-score analysis in 1000 Genomes

Hoang, Q. P.; Le, T. X.; Doan, D. D.

2026-07-15 genetic and genomic medicine 10.64898/2026.07.10.26357796 medRxiv

Top 10%

0.1%

Show abstract

Background. Polygenic scores (PRS) for coronary artery disease (CAD) are derived almost entirely from European-ancestry data. Their portability to Southeast Asian populations, including the Vietnamese, is largely uncharacterised and clinically consequential when scores are used with risk thresholds. Methods. We evaluated four independent European-derived CAD scores from the PGS Catalog (PGS000058, PGS000349, PGS002809, PGS004198; 70 - 5,723 variants) in 2,504 individuals from the 1000 Genomes Project, focusing on the Vietnamese Kinh (KHV) and Dai (CDX) samples. Per-individual scores were computed with PLINK2 and standardised. We assessed (i) the cross-ancestry distribution (calibration) and (ii) a clinically-relevant consequence: the proportion of each population flagged high genetic risk when the European top-20% threshold is applied (20% if perfectly calibrated). Results. For the primary score (PGS000058) the standardised PRS differed across super-populations (ANOVA F(4, 2499) = 121.1, p < 0.001); the Vietnamese Kinh mean was +0.47 SD above the European mean (Welch t = 7.77, p = 2.0 x 10^ -14). Applying the European top-20% high-risk threshold, the fraction of Vietnamese Kinh flagged ranged from 22.2% to 57.6% across the four scores, and of Dai from 21.5% to 43.0%, versus the intended 20%. Three of the four scores over-flagged Vietnamese (25-58%); the largest score (PGS004198) was approximately calibrated for East/Southeast Asians ([~]22%) but markedly over-flagged Africans (69.3%). Conclusions. European-derived CAD polygenic scores are inconsistently calibrated in Vietnamese and other Southeast Asian samples, and most substantially over-flag high genetic risk when a European threshold is applied. The magnitude and even the direction of miscalibration depend on the specific score, so no such score can be assumed transferable without local validation and recalibration. Distribution shift bounds, but does not by itself quantify, loss of predictive accuracy, which requires phenotyped data.

19

Natural genetic variation reveals divergent transcriptomic responses to hyperoxia in two Chlamydomonas reinhardtii ecotypes

Temple, J. A.; Neofotis, P. G.; Lucker, B. F.; Bibik, J. D.; Kramer, D. M.; Strenkert, D.

2026-07-15 genomics 10.64898/2026.07.09.737578 medRxiv

Top 11%

0.1%

Show abstract

Green algae must continuously balance resource availability to maintain photosynthetic performance. The O2:CO2 ratio is a key determinant of their metabolic mode. Under hyperoxia or low CO2, many algae induce a carbon concentrating mechanism (CCM). In the model green alga Chlamydomonas reinhardtii, the CCM relies on a pyrenoid, a specialized microcompartment that elevates CO2 around rubisco. While ambient CO2 acclimation is well-studied, responses to hyperoxia remain poorly understood, despite its frequent occurrence in nature under high light. Using controlled bioreactors, we exposed two diverse Chlamydomonas ecotypes, CC1009 and CC2343, to 95% oxygen to analyze time-dependent, genome-wide transcriptomic and phenotypic changes. Both ecotypes induced CCM genes, but they exhibited distinct molecular and physiological phenotypes. The tolerant ecotype (CC1009) successfully adapted, developing a functional CCM with a structured starch sheath. Conversely, the sensitive ecotype (CC2343) suffered growth arrest and formed malformed pyrenoids. Transcriptomics revealed that CC1009 initiated a rapid initial response, upregulating chloroplast proteostasis and downregulating nucleotide metabolism. CC2343 showed a massive, delayed transcriptional response, downregulating genes coding for photosystems and tetrapyrrole biosynthesis. This unbiased transcriptomic approach identifies key candidate genes driving algal acclimation to hyperoxic stress in natural, high-light environments.

20

Single-cell gene programs define subtype identity and metastatic trajectories in renal cell carcinoma

Madrigal, A.; Kim, M.; Mehrjoo, Z.; Nishimura, T.; Saatci, O.; Osakwe, A.; Zavacky, E.; Moslemi, E.; Glennon, K. I.; Dankner, M.; Maritan, S. M.; Kuasne, H.; Pilon, V.; Monast, A.; Soytas, M.; Arseneault, M.; Oikonomopoulos, S.; Harutyunyan, A.; Lu, T.; Rayes, R.; Soto, L. M.; Hernandez-Corchado, A.; Spicer, J. D.; Petrecca, K.; Siegel, P.; Park, M.; Ragoussis, J.; Sahin, O.; Brimo, F.; Tanguay, S.; Riazalhosseini, Y.; Najafabadi, H. S.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26357682 medRxiv

Top 11%

0.1%

Show abstract

While extensive cellular heterogeneity in renal cell carcinomas (RCC) is linked to diverse clinical outcomes, our understanding of this diversity is limited to those driven by clonal patterns or activity of canonical pathways. Here, we present a compendium of over 85,000 single-cell gene expression profiles from primary and metastatic tumors as well as patient-derived models across four RCC subtypes, including the rare clear cell papillary renal cell tumors, which we show are often misclassified and for which we identify CASP14 as a highly sensitive and specific biomarker. We dissect malignant cell variation within and across tumors using a generative modeling framework that accounts for clonal and copy number-driven expression shifts, defining 59 gene expression programs that deconstruct canonical pathways into functional submodules with divergent activity patterns, distinct regulators, and differential association with clinical outcomes. Despite the canonical view that VHL-deficient clear cell RCC exists in a constitutive pseudohypoxic state, we show strong intra-tumor variability of a hypoxia inducible factor 2 (HIF2)-driven program linked to poor outcome. We also identify early, spatially organized activation of a complete epithelial-to-mesenchymal transition (EMT) program, loss of epithelial identity, and upregulation of protein translation programs as key characteristics of metastatic progression. Finally, a metastatic signature capturing cellular de-differentiation and translational activity identifies primary tumors associated with adverse clinical outcomes. Together, this resource establishes a framework for dissecting malignant cell heterogeneity, refines RCC subtype classification, and defines transcriptional programs underlying metastasis progression.